The Wine Quality dataset consists of red wine samples. We will be analyzing a dataset with 1,599 red wine samples. Each wine sample comes with a quality rating from one ( bad quality) to ten ( high quality) . In this project we will discover which chemical propeties influence the quality of red wines and to understand how these characteristics influence the quality
## [1] 0
## [1] 1599 13
There are 1599 observations and 13 variables in the dataset.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
AS we can see above in histograms density and PH normally disributed but the rest of variables are more or less right skewed. The quality dependent variable has almost normal discrete distribution.
Most wines between 5 and 6 .If we see rare win with high quality with rate 8 also with bad quality (3,4) rate .Rate 7 has almost 200 .We goning investigate more below about these different observations .
## [0,5) [5,7) [7,10]
## 63 1319 217
##
## Low Medium High
## 63 1319 217
As you can see above we demonstrating wine quality (LOW 0,5) , (Medium 5,7) (high 7,10 ) . We can see clearly most wine fill as Medium .The chart above make me confidence with the data quality no outlier .
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality quality.category
## 1 5 Medium
## 2 5 Medium
## 3 5 Medium
## 4 6 Medium
## 5 5 Medium
## 6 5 Medium
## quality fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 3 8.360000 0.8845000 0.1710000 2.635000 0.12250000
## 2 5 8.254284 0.5385595 0.2582638 2.503867 0.08897271
## 3 8 8.566667 0.4233333 0.3911111 2.577778 0.06844444
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 11.00000 24.90000 0.9974640 3.398000 0.5700000
## 2 16.36846 48.94693 0.9968673 3.311296 0.6472631
## 3 13.27778 33.44444 0.9952122 3.267222 0.7677778
## alcohol
## 1 9.95500
## 2 10.25272
## 3 12.09444
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The average mean quality of red wines is 5.63 and median is 6 .I think we have outliers here with free.sulfur and total.sulffur becuse the number dose it make since at all with reating . The wine samples with the highest score have the lowest level of density, volatile acidity, pH, and sugar the lowest score has the same median.
So now the big question what’s the factor has impact on wine values and rating . ## Attributes below increase values and rating 1.Alcohol
2.fixed acidity
3.citric acid
4.sulphates
1.Density
2.volatile acidity
3.pH
4.sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Data above is lightly right skewed with minimum value of 4.5, maximum of 15.7 and median of 7 and mean of 8. The boxplot shows a few outliers from 12 to 16.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The variability between low and high quality categories is high comparing to other variables . There are a few outliers between the higher range, around 1.0 to 1.6 and median of 0.52 and mean of 0.52
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric Acid data is right skewed with minimum value of 0, maximum of one outlayer and median of 0.26 and mean of 0.27.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Data is right skewed with minimum value of 3 , maximum of 15.8 many outliers here .and median of 2.2 and mean of 2.5 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
chlorides data is right skewed with minimum value of 0.012, maximum of 0.61 median of 0.079 and mean of 0.087.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free Sulfur Dioxide data is right skewed with minimum value of 1, maximum of 72 and median of 14 and mean of 15.87.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Sulfur dioxide data is right skewed with minimum value of 6, maximum of 289 (outlayers) and median of 38 and mean of 46.47.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
density data is normal with minimum value of 0.9901, maximum of 1.0037 and median of 0.9968 and mean of 0.996
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH data is normal with minimum value of 2.740, maximum of 4.010 and median of 3.310 and mean of 3.311 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
sulphates data is right skewed with minimum value of 0.33, maximum of 2 and median of 0.62 and mean of 0.658
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
alcohol data is right skewed but does not have many outliers with minimum value of 8.4, maximum of 14.9 and median of 10.2 and mean of 10.42.
Less volatile.acidity in a sample results in higher wine quality, * The bigger citric.acid level is in a sample on average the better quality of the sample is. The samples with citric.acid level above 0.5 will almost never be classsified as of Low quality, * The bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables, * Only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.
The correlation matrix shows that fixed.acidity is highly positively correlated with density and citric.acid. total.sulfur.dioxide is highly positively correlated with free.sulful.dioxide. pH is highly negatively correlated with fixed.acidity. citric.acid is correlated negatively with volatile.acidity and pH
## Red_wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## ------------------------------------------------------------
## Red_wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## ------------------------------------------------------------
## Red_wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## ------------------------------------------------------------
## Red_wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## ------------------------------------------------------------
## Red_wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## ------------------------------------------------------------
## Red_wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
The trend between alcohol and quality is clearer, with the highest quality score having the largest median. In other words, the amount of alcohol increases with better quality raking. Additionally, most outliers have a score of 5, and that explains why the median is lower than score of 4.
In the first scatterplot it shows that the good quality wines have more fixed acidity and more citric acid And in second scatterplot it shows good wines with low volatile and high cirtic acid contents .
In scatterplot above we see that for the same value of pH the quality of wine increases as the alcohol content increases
Alcohol has same relation with sugar as with acids for the quality
ِEvery variable distribution and density differences explored from different perspectives: through a histogram, a histogram with a log10 scale, density chart, and box plot for all variables. 80% wines have an average score Alcohol, fixed acidity, citric acid, sulphates increase with a better rating. Density, volatile acidity, pH, and sugar decrease with a better rating.
Most wine samples are of 6 and 5 (almost 80% of the dataset). Moreover, it seems to be that wines which received the highest score (8) have a few observations. This situation repeats at a low level (3, 4). Wines with a score of 7 have 200 observations.
The trend between alcohol and quality is clearer, with the highest quality score having the largest median. In other words, the amont of alcohol increases with a better quality raking. Also, most outliers have a score of 5, and that explains why the median is lower than a score of 4.
The red wine data set contains information on almost 1,600 red wine samples across 12 chemical properties .Almost 80% of our dataset received an average score (5,6) and the highest score (8) holds only 1% (18 rows) of observations.The mean was not totally reliable in a few attributes as sugar and chlorides. These attributes had a significant difference from the median .In the future, there could be more features added (grown country, weather conditions, wine making process specifics, etc.) to the dataset . The corrplot was crucial to understand the interactions of the chemicals which required further research, emphasizing the need to understand the basics of the domain in order to perform effective analysis.Another area which required a lot of effort was in visualizing the interations-how best to capture the not so obvious relationship between the variables which can be telling. Finally this first time i’m using R language it’s was little hard , specially for someone was using java ، c language .